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Abstract 

Professional translators often dictate their translations orally 
and have them typed afterwards. The TransTalk project aims 
at automating the second part of this process. Its originality as 
a dictation system lies in the fact that both the acoustic signal 
produced by the translator and the source text under transla- 
tion are made available to the system. Probable translations 
of the source text can be predicted and these predictions used 
to help the speech recognition system in its lexical choices. We 
present the results of the first prototype, which show a marked 
improvement in the performance of the speech recognition task 
when translation predictions are taken into account. 



1 Introduction 

The integration of machine translation and speech technol- 
ogy is currently the focus of major projects in several coun- 
tries ^, [l0| . Usually, the aim of these efforts is some type 
of speech-to-speech translation, where speech recognition, ma- 
chine translation and speech synthesis are performed sequen- 
tially. However, both speech recognition and machine trans- 
lation are tasks that can at present be reliably accomplished 
only under stringent lexical, syntactic and semantic restric- 
tions, and consequently developers of speech-to-speech transla- 
tion systems need to find application domains for which narrow 
sub-languages can be naturally defined. 

In the TransTalk project, we attempt to integrate speech 
recognition and machine translation in a way which, instead of 
compounding the weaknesses of both technologies, makes max- 
imal use of their complementary strengths. We do not try to re- 
place the human translator by a machine (a hopeless endeavor, 
in general), but undertake instead the more realistic task of 
providing a dictation tool to the translator. Our aim is to use 
machine translation to make probabilistic predictions of the 
possible target language verbalizations freely produced by the 
translator, and to use these predictions to reduce the difficulty 
of the speech recognition task to such an extent that complete 
recognition of the translator's utterances can be achieved.n 



* Present address: Rank Xerox Research Centre, 6 chemin de 
Maupertuis, 38240 Meylan, France. 

1 The idea was independently advanced by us |4| and by re- 
searchers at the IBM Thomas J. Watson Research Center lull. 
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Figure 1: TransTalk's underlying model. Starting from an En- 
glish sentence e, the translator mentally formulates its French 
translation f , then produces its acoustic rendering s. The sys- 
tem's aim is to find f = argmaxj- p(f | e, s), or equivalently, 
from Bayes's formula, f = argmax^ p(s | e, f ) ■ p(f |e). By ne- 
glecting the influence of e on s once f is known, we can take 
f = argmax^ p(s | f) ■ p(f | e). 



For example, suppose that, in the case of English-to-French 
translation, the translator decides to render the sentence "what 
splendid horses you have" as "tes chevaux sont vraiment mag- 
nifiqucs". A speech recognition system without access to 
the source text might have difficulty distinguishing chevaux 
(horses) from the acoustically close, and contextually more 
likely, cheveux (hair). On the other hand, the presence in the 
English source of the word horses serves as a strong indicator 
that the correct choice should be chevaux, and it is on such 
knowledge of probable translations that TransTalk attempts to 
capitalize. 

Conceptually, the main difference between a conventional 
"noisy channel" speech recognition system for French and 
TransTalk is that, instead of maximizing in f the product 
p(s | f) ■ p(f) of an "acoustic model" and a "language model" 
for French (where s stands for the acoustic signal and f for the 
French sentence), we maximize the product p(s j f) • p(f | e) of 
an acoustic model and a "translation model" from English to 
French (where e stands for the English sentence under transla- 
tion). See figure [j]. 

We have implemented a prototype version of TransTalk that 
operates in an isolated- word dictation mode over a vocabulary 
of 20,000 French word forms. It is specialized for the domain 
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of Canadian Parliamentary debates, which are transcribed in 
bilingual form in the Canadian Hansard corpus. Two years 
of Hansard transcripts (approximately 10M French words and 
10M English words) were used as training data for the trans- 
lation model. 

2 Acoustic model 

We use an HMM based on context-independent phone models 
to describe p(sjf). The TransTalk vocabulary is represented 
with a set of 47 phonemes including 20 vowels and 27 conson- 
nants. The base pronunciations were obtained using a set of 
grapheme-to-phoneme rules which take into account phonetic 
particularities found in the French spoken in Quebec such as 
assibilation and vowel laxing. 

Recognition is performed with an n-best search of a com- 
pressed phonetic graph representing the entire 20,000 word vo- 
cabulary . This graph is such that no two paths produce the 
same phone sequence and every path corresponds to a valid 
phonetic representation in the dictionary. A given path will 
therefore correspond to all lexicon entries sharing the same 
phonetics. The search yields a list containing the 20 most 
acoustically probable words for each (isolated) acoustic token. 

3 Translation model 

The aim of the translation model is to describe p(f |e), the 
probability that a translator will produce a French translation 
f for an English sentence e. 

3.1 Modelling Approaches 

There are at least two distinct approaches to modelling this 
distribution. In |J, Brown et al. expand it as the product 
p(f) -p(e|f), to which it is proportional under maximization 
over f . The main advantage of this arrangement is that it 
provides for a division of labour in which p(f) is responsible 
for the well-formedness of f , and p(e | f) for ensuring that e 
and f are acceptable translations without having to be unduly 
preoccupied with the internal structure of either. Although 
this is a powerful technique, it has one drawback that makes 
it unsuitable for our purposes: it does not easily lend itself to 
efficient searches over large sets of French sentence candidates. 

Because of this, we have chosen to model p(f | e) more di- 
rectly as a family of parameterized Markov language models 
Px(e){f), where each e specifies a parameter vector A, not nec- 
essarily uniquely. This approach presents the challenge of in- 
corporating information from e in a way that does not inter- 
fere with the language model's knowledge of the structure of 
French — particularly for language models that are accurate to 
begin with. In the work reported here we have largely avoided 
this difficulty by using a fairly weak language model; our aim 
is mainly to investigate to what exent the performance of such 
a model can be improved without substantially increasing its 
low run-time cost. 

3.2 Derivation 

The translation model is based on a standard tri-class language 
model conditioned on e. The first key assumption we make is 



La motion est adoptee . 

: / y / 

Motion agreed to . 

Figure 2: An example of an alignment, one of 5 5 which are 
possible for this sentence pair. 

that the sequence c of word classes for f is independent of e, 
which allows us to write: 

p(f|e)=^p(c)-p(f|c,e) (1) 
c 

This approximation is motivated by the intuition that e will 
be most informative about the actual words in f, and only 
weakly informative about gross syntactic structure of the sort 
that c captures. Because it is most valid when c consists of 
broad classifications^ we use a minimal set of 15 classes which 
correspond to the major grammatical categories (noun, verb, 
etc). 

To incorporate translation information, we suppose, follow- 
ing Brown et al. ||, that f and e are related via an alignment 
(see figure ^) in which each French word is connected to either 
a single English word in e or none at all. An alignment can be 
represented as a vector a of length f which contains, for each 
French word, the position in e of the English word to which 
it connects, or zero if it is not connected. We assume that 
all Af _ possible alignments are equally likely, with probability 
p(a | c, e) = 1/Af e , so we have:^. 

p(f|c,e)= V J-.p(f|a,c,e) (2) 
a ^ 

This is a rough approximation which runs contrary to our 
knowledge that some alignments — such as those in which all 
French words connect to a single English word, or those in 
which French verbs connect only to English prepositions — will 
be much less likely than others. Its purpose is simplification, 
and we justify it on the grounds that a reasonable model for 
p(f | a, c, e) will minimize the contribution from (most) poor 
alignments in any case. 

The final step is to assume that the words in f are condition- 
ally independent given a, c, and e, and furthermore that each 
word depends only on its class and the English word to which 
it connects in the alignment: 

|f| 

p(f ]a,c,e) = JJp(/i|ci,e 0j ) (3) 

Our complete model is a Markov source (see figure ^|) which 
depends on two sets of parameters: contextual parameters of 
the form p(ci | Ci-2, Ci-i), which predict a class from its two pre- 
decessors; and bi-lexical parameters of the form p(/ | c, e), which 
predict a French word from its class and its English partner. 

It is possible to rearrange the straightforward combination 
of equations 1, 2, and 3 in a way which permits more efficient 

2 This assumption becomes increasingly untenable for finer classi- 
fication schemes; in the limit when classes are identical to words, the 
model collapses into a pure tri-gram with no translation component 
whatsoever. 

3 Where A { e = (|e| + l)l f l 
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Figure 3: The structure of the Markov source underlying the 
translation model. First, c is established by choosing each class 
based on the previous two with probability given by the appro- 
priate contextual parameter. Next a is established by picking a 
position in e at random for each position in c. Finally, f is gen- 
erated by choosing each word based on its class and its English 
partner, with probability given by the appropriate bi-lexical 
parameter. 



calculations. The key observation is that the sum over all align- 
ments can be reorganized into a product of sums over English 
words. The result is the equation 



P( f I e ) = X! II P(Ct I C *"2' C »"l)p(/i Cil e ) 



(4) 



where p(fi \ d,e) = ^2'^ p(fi I Ci > e i)/(l e l + !)■ From this it 
should be obvious that our translation model is nothing more 
that a standard tri-class model in which the lexical parameters 
p(f | c) have been replaced by p(f | c, e). 



3.3 Parameter Estimation 

The two families of parameters in the translation model were 
estimated separately. Contextual parameters were estimated 
as part of a pure tri-class language model for French, which 
was trained on the French half of our bilingual corpus via the 
EM algorithm, using a dictionary to identify valid classes for 
each word. 

Bi-lexical parameters were estimated as part of a simplified 
translation model in which contextual information was assumed 
to be explicit: 



p(f,c|e) = -j—^YipifhCi^ai) 



(5) 



To train this model, we first aligned the training corpus to the 
sentence level using the method described in H. To improve 
the quality of our training data, we filtered out alignments 
which involved more than one sentence in either language as 
well as those which contained more than 40 tokens in either 
language — this reduced the size of the training set by approx- 
imately 20%, to about 8M tokens in each language. Next, we 
used the pure language model to tag each word in the French 
part of the reduced corpus with its most likely class. Finally, 
we used the EM algorithm to estimate parameters p(f, c | e) 
from the aligned, tagged corpus. These were transformed into 
bi-lexical parameters as follows: 
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Figure 4: A sample of TransTalk's bi-lexical parameters. These 
are the ten most probable French words, given the class NOUN 
and the English word government. 

Figure ^ shows a sample of the results. 

Because many valid bi-lexical combinations do not occur in 
our training corpus, it was necessary to smooth the bi-lexical 
parameters. Rather than modifying the empirical distribu- 
tion p(f | c, e) directly, we chose to dynamically smooth the 
more robust quantity p(f | c, e) involved in calculations based 
on equation ^. We experimented with three simple methods of 
combining this with the less precise but more reliable lexical 
parameters p(f | c) from the pure language model: linear in- 
terpolation; using the maximum of p(f | c, e) and p(f | c); and 
using p(f \ c) iff max e p( f \ c, e) /p(f | c, e) did not exceed some 
threshold. The rationale for the second method is that we 
expect higher probabilities to be more reliably estimated on 
average than lower ones. The third method is intended to re- 
ject translation information when there is no English word that 
is strongly associated with the current French word. Because 
the last two methods result in unnormalized distribitions, they 
can be compared only in terms of recognition performance and 
not by means of the perplexity measure (see section 

4 Search 

The aim of the search component is to find an approximation 
to the sentence f that maximizes the product of acoustic and 
translation scores p(s | f) • p(f | e). Our search algorithm is di- 
vided into two stages, both of which are suboptimal. 

The first stage involves using the acoustic model to prune the 
list of word hypotheses for each acoustic token from 20,000 to 
some number n (currently 20). Since this pruning is performed 
without reference to the translation model, there is no guaran- 
tee that f is among the rJ^I sentence candidates retained. 

The second stage is a Viterbi search through the remaining 
sentence candidates using the translation model. This permits 
us to find the pair (f , c) that maximizes the product p(s | f ) • 
p(f, c | e) in time which is proportional to nC z \i ||e|, where C is 
the number of word classes in the translation model (currently 
15). Given the coarse nature of our word classes, we feel that 
f is a reasonable approximation to f . 

5 Results 

We tested TransTalk on a small corpus of 50 French/English 
sentence pairs from the Hansard corpus which were not used as 
training data. The French sentences were all between 15 and 20 
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Figure 5: Comparison of language model (LM) and translation 
model (TM) results for a sentence pair (F,E) from the test 
corpus. (This pair has been truncated for space reasons.) Lines 
indicate salient parts of the most probable alignment between 
the output sentence and E. The presence of equity in the English 
source allowed the translation model to correctly choose quit 
instead of qualit. 



Model 


Words Correct (/918) 


Perplexity 


Speaker 1 


Speaker 2 


pure language 
interpolated (.85) 
maximum 
e-testing (.30) 


686 (74.7%) 
735 (80.1%) 
735 (80.1%) 
742 (80.8%) 


677 (73.8%) 
734 (80.0%) 
732 (79.7%) 
734 (80.0%) 


385 
180 



Figure 6: Summary of TransTalk results. The first line con- 
tains statistics for the pure language model; the remaining 
lines contain statistics for the translation model with eac h of 
the three different smoothing methods described in section 3.3 
(where .85 was the optimum the weighting factor for bi-lexical 
parameters, and .30 was the optimum confidence threshold). 



tokens in length (counting punctuation) and were selected so as 
not to contain words outside our 20,000 word vocabulary. They 
were dictated in isolated-word mode by two different speakers. 

Figure ^ illustrates the results for a single sentence pair. 
Overall statistics are given in figure |(| The translation model 
yielded an average error-rate decrease of 24% over the pure 
language model. For errors which involved "content" words 
(eg, action for section) the decrease was 42%. The perplexity 
of the test corpus was reduced by more than half by the use of 
the translation model. 
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6 Conclusions 

Our initial results demonstrate that it is possible to cheaply 
and effectively make use of translation information for speech 
recognition. We feel that the simple approach described in 
this paper barely begins to tap the potential of the TransTalk 
idea, and we are currently investigating a number of ways of 
improving on it. 
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